Data Science with

DAY 2



1 CASE STUDY. A study on the effects of exercise on health with data in four flat files



It is desired to assess the effectiveness of a new exercise programme which is believed to induce weight loss and improve self-rated health. A group of 5000 people were selected and randomized into two groups: 2,500 people to the treatment group (which receives the exercise intervention) and another 2,500 to the control group (does not receive the exercise intervention). Health outcomes, weight and self-rated health, were measured before the start of the intervention and immediately after the intervention.

The data was provided in 4 files:

*EXER_age_sex_race.csv : Subject demographics at baseline: age, sex, and race (we have seen this data when learning how to import comma separated values files)

*EXER_SRH.csv : Self-Rated Health (SRH) on an ordinal scale.

*EXER_weight_trt : Weight, in pounds, for Treatment Group.

*EXER_weight_con : Weight, in pounds, for Control Group.

Let us import these files using the read_csv() function and explore the variables they contain.

#load the libraries needed
library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(ggplot2)
library(ggmosaic)
library(forcats)
data1 <- read_csv("DataFiles/EXER_age_sex_race.csv")
## Parsed with column specification:
## cols(
##   subject_ID = col_integer(),
##   SexAge_Race = col_character()
## )
glimpse(data1)
## Observations: 5,000
## Variables: 2
## $ subject_ID  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ SexAge_Race <chr> "MALE41.2_White", "FEMALE42.9_White", "FEMALE38.5_...

We can see that there are 5000 subjects in the study and all the information about each subject is in one column. So, we will need to separate this information. Also, we can see that there are missing values for race (people can opt out from disclosing this type of information).



data2 <- read_csv("DataFiles/EXER_SRH.csv")
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   trt = col_integer(),
##   TIME = col_character(),
##   SRH = col_character()
## )
glimpse(data2)
## Observations: 10,035
## Variables: 4
## $ id   <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10,...
## $ trt  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ TIME <chr> "POST", "PRE", "PRE", "POST", "PRE", "POST", "PRE", "POST...
## $ SRH  <chr> "Poor", "Good", "Poor", "Very Poor", "Satisfactory", "Goo...
unique(as.factor(data2$trt))
## [1] 1 0
## Levels: 0 1

This file contains the patient id (note the different name from the previous file), each participant’s self rated health before and after the study. The column trt can take on two values: 0 (individual didn’t exercise), 1 (individual exercised). Notice that data2 has 10035 rows but there should only be 10000 rows. This needs to be investigated.

data3 <- read_csv("DataFiles/EXER_weight_trt.csv")
## Parsed with column specification:
## cols(
##   Id = col_integer(),
##   PRE_WEIGHT = col_double(),
##   POST_WEIGHT = col_double()
## )
glimpse(data3)
## Observations: 5,017
## Variables: 3
## $ Id          <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9,...
## $ PRE_WEIGHT  <dbl> 135.2510, NA, 154.8713, NA, 128.1951, NA, 183.4600...
## $ POST_WEIGHT <dbl> NA, 125.6678, NA, 153.9882, NA, 115.5969, NA, 177....

This data set contains information about the weight of patients who received the exercise plan. The patient id, again a different name than in the previous two data sets, pre-intervention weight and post-intervention weight. Note that many NAs are generated: if pre-intervention weight is recorded, post-intervention weight is an NA, and vice-versa. Also, we expect 5000 observations, instead there are 5017.



data4 <- read_csv("DataFiles/EXER_weight_con.csv")
## Parsed with column specification:
## cols(
##   obs_ID = col_integer(),
##   PRE_WEIGHT = col_double(),
##   POST_WEIGHT = col_double()
## )
glimpse(data4)
## Observations: 5,012
## Variables: 3
## $ obs_ID      <int> 2501, 2501, 2502, 2502, 2503, 2503, 2504, 2504, 25...
## $ PRE_WEIGHT  <dbl> 159.7587, NA, 176.1611, NA, 181.3907, NA, 175.6615...
## $ POST_WEIGHT <dbl> NA, 158.6920, NA, 174.8270, NA, 179.9042, NA, 175....

This is information about the patients in the control group (didn’t follow the exercise programme). Again, the patient id column has a different name in this file and pre- and post- intervention weight are recorded in similarly as in the file for the treatment group. We expect 5000 observations, instead there are 5012.



In this study an observational unit is a patient, fixed variables are age, sex and race. Measured variables are self-rated health and weight at two time points, before and after the intervention.


This is an example of messy data where many features of a messy data set can already be observed. Firstly a single observational unit is stored in multiple tables, multiple variables are stored in one column, there is possibly duplicated information, and more messy features to be discovered.


The strategy is to deal with the data sets one by one and manipulate each of them so that they can be joined. We will have to join by subject id so let us name that column id in all four data frames.

In the first instance we want a tibble with one row per patient.

1.1 First data set (data1)

glimpse(data1)
## Observations: 5,000
## Variables: 2
## $ subject_ID  <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ SexAge_Race <chr> "MALE41.2_White", "FEMALE42.9_White", "FEMALE38.5_...
# data1_1 will be the modified and tidy version of data1. Keep data1 for reference.
data1_1 <- data1 

First of all let us rename the first column

# rename the first column
data1_1 <- rename(data1_1, id = subject_ID) 
# add a _ after MALE to separate sex and age
data1_1 <- mutate(data1_1, SexAge_Race = str_replace_all(SexAge_Race, "MALE", "MALE_")) 
# separate values between _
data1_1 <- separate(data1_1, SexAge_Race, c("Sex", "Age", "Race"), sep = "_")
glimpse(data1_1)
## Observations: 5,000
## Variables: 4
## $ id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ Sex  <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE",...
## $ Age  <chr> "41.2", "42.9", "38.5", "35.6", "48.5", "36.9", "28.9", "...
## $ Race <chr> "White", "White", "White", "Hispanic", "White", "NA", "Wh...
data1_1 <- mutate(data1_1, Age = as.numeric(Age)) # we make Age a numeric variable
glimpse(data1_1)
## Observations: 5,000
## Variables: 4
## $ id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ Sex  <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE",...
## $ Age  <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, 37....
## $ Race <chr> "White", "White", "White", "Hispanic", "White", "NA", "Wh...

Now note that the Race column of data1_1 has the value “NA”, which R reads as a character string and not as a missing value. We need to transform the entries “NA” into NA.

data1_1 <- mutate(data1_1, Race = na_if(Race, "NA"))
glimpse(data1_1)
## Observations: 5,000
## Variables: 4
## $ id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ Sex  <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE",...
## $ Age  <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, 37....
## $ Race <chr> "White", "White", "White", "Hispanic", "White", NA, "Whit...


Summary of the wrangling:

data1_tidy <- read_csv("DataFiles/EXER_age_sex_race.csv") %>%
  rename(id = subject_ID) %>%
  mutate(SexAge_Race = 
           str_replace_all(SexAge_Race, "MALE", "MALE_")) %>%
  separate(SexAge_Race, c("Sex","Age","Race"), sep = "_") %>%
  mutate(Age = as.numeric(Age)) %>% 
  mutate(Race = na_if(Race, "NA"))
## Parsed with column specification:
## cols(
##   subject_ID = col_integer(),
##   SexAge_Race = col_character()
## )



1.2 Second data set (data2)

Firstly, we must resolve the issue that there are 10,035 rows when we expect 10,000. We will first attempt to remove exact duplicate rows.

glimpse(data2)
## Observations: 10,035
## Variables: 4
## $ id   <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10,...
## $ trt  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ TIME <chr> "POST", "PRE", "PRE", "POST", "PRE", "POST", "PRE", "POST...
## $ SRH  <chr> "Poor", "Good", "Poor", "Very Poor", "Satisfactory", "Goo...
data2_1 <- data2#we keep data2 intact for reference. Changes will go into data2_1
data2_1 <- distinct(data2)#remove duplicate rows
glimpse(data2_1)
## Observations: 10,016
## Variables: 4
## $ id   <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10,...
## $ trt  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ TIME <chr> "POST", "PRE", "PRE", "POST", "PRE", "POST", "PRE", "POST...
## $ SRH  <chr> "Poor", "Good", "Poor", "Very Poor", "Satisfactory", "Goo...

After removing duplicate rows now we have 10,016 rows. We will obtain summaries of the columns to see if we can spot something unusual.

summary(as.factor(data2_1$trt))
##    0    1 
## 5008 5008
summary(as.factor(data2_1$TIME))
## POST  PRE 
## 5003 5013
summary(as.factor(data2_1$SRH))
##    Excellent         Good         Poor Satisfactory   Very  Poor 
##         2947         1570         1711         1411           16 
##    Very Poor 
##         2361

We spotted something unusual! The SRH level Very Poor appears also with a double space between Very and Poor. This happens for 16 rows.

data2_1 <- mutate(data2_1, SRH = str_replace_all(SRH, "Very  Poor", "Very Poor"))
glimpse(data2_1)
## Observations: 10,016
## Variables: 4
## $ id   <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10,...
## $ trt  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ TIME <chr> "POST", "PRE", "PRE", "POST", "PRE", "POST", "PRE", "POST...
## $ SRH  <chr> "Poor", "Good", "Poor", "Very Poor", "Satisfactory", "Goo...
summary(as.factor(data2_1$SRH))
##    Excellent         Good         Poor Satisfactory    Very Poor 
##         2947         1570         1711         1411         2377
data2_1 <- distinct(data2_1)#remove duplicate rows
glimpse(data2_1)
## Observations: 10,000
## Variables: 4
## $ id   <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10,...
## $ trt  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ TIME <chr> "POST", "PRE", "PRE", "POST", "PRE", "POST", "PRE", "POST...
## $ SRH  <chr> "Poor", "Good", "Poor", "Very Poor", "Satisfactory", "Goo...

We have two rows per patient. We will spread the TIME column, to create two new columns with PRE-SRH and POST-SRH. Remember that eventually we have to merge all four data sets into one. The first data set has one row per observation and we will follow that format.

data2_1 <- spread(data2_1, TIME, SRH)
glimpse(data2_1)
## Observations: 5,000
## Variables: 4
## $ id   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ trt  <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ POST <chr> "Poor", "Very Poor", "Good", "Good", "Poor", "Excellent",...
## $ PRE  <chr> "Good", "Poor", "Satisfactory", "Poor", "Poor", "Good", "...

We change the names of the 3rd and 4th columns to POST_SRH and PRE_SRH and swap their order.

data2_1 <- rename(data2_1, POST_SRH = POST, PRE_SRH = PRE)
data2_1 <- select(data2_1, c(1,2,4,3))
glimpse(data2_1)
## Observations: 5,000
## Variables: 4
## $ id       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ trt      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ PRE_SRH  <chr> "Good", "Poor", "Satisfactory", "Poor", "Poor", "Good...
## $ POST_SRH <chr> "Poor", "Very Poor", "Good", "Good", "Poor", "Excelle...


Summarising the tidying process:

data2_tidy <- read_csv("DataFiles/EXER_SRH.csv") %>%
  mutate(SRH = str_replace_all(SRH, "Very  Poor", "Very Poor")) %>%
  distinct() %>%
  spread(TIME, SRH) %>%
  rename(POST_SRH = POST, PRE_SRH = PRE) %>%
  select(c(1,2,4,3))
## Parsed with column specification:
## cols(
##   id = col_integer(),
##   trt = col_integer(),
##   TIME = col_character(),
##   SRH = col_character()
## )

1.3 Third data set (data3)

Here we have the pre- and post-weights of the patients who received the exercise plan treatment. We will change the name of the first column to id, in line with the two previous data sets, keeping in mind we need to merge the data sets once they are in suitable form and tidied up.

glimpse(data3)
## Observations: 5,017
## Variables: 3
## $ Id          <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9,...
## $ PRE_WEIGHT  <dbl> 135.2510, NA, 154.8713, NA, 128.1951, NA, 183.4600...
## $ POST_WEIGHT <dbl> NA, 125.6678, NA, 153.9882, NA, 115.5969, NA, 177....
head(data3)
## # A tibble: 6 x 3
##      Id PRE_WEIGHT POST_WEIGHT
##   <int>      <dbl>       <dbl>
## 1     1       135.         NA 
## 2     1        NA         126.
## 3     2       155.         NA 
## 4     2        NA         154.
## 5     3       128.         NA 
## 6     3        NA         116.
data3_1 <- data3
data3_1 <- rename(data3_1, id = Id)

There should be 5,000 rows in data3 but there are 5,017 rows. Note the very inefficient way the data is recorded. There should be just one row per patient. Instead there are two and lots of NAs.

#remove duplicate rows
data3_1 <- distinct(data3_1)
glimpse(data3_1)
## Observations: 5,017
## Variables: 3
## $ id          <int> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9,...
## $ PRE_WEIGHT  <dbl> 135.2510, NA, 154.8713, NA, 128.1951, NA, 183.4600...
## $ POST_WEIGHT <dbl> NA, 125.6678, NA, 153.9882, NA, 115.5969, NA, 177....

There are no exact duplicate rows in data3. Let us see if there are more than two records per patient. The function count in the dplyr package can be used to count how many instances for each patient id there are in the data set. For example:

aux1 <- count(data3_1[1:10,],id)#counts how many instances of each patient id there are in the first twenty rows
aux1
## # A tibble: 5 x 2
##      id     n
##   <int> <int>
## 1     1     2
## 2     2     2
## 3     3     2
## 4     4     2
## 5     5     2

The result has two columns. One is id and the other one is the corresponding count, n.

aux1 <- count(data3_1,id)#counts how many instances of each patient id there are in the whole data set
id_trouble <- aux1 %>% 
  filter(n > 2) %>% 
  select(id) %>% 
  pull(id)
id_trouble
## [1] 2394 2395 2396 2397 2398 2399 2400 2401 2402

These are the patient id’s with more than two entries. Let us explore the data for those patients.

ff <- filter(data3_1, id %in% id_trouble)
as.data.frame(ff)
##      id PRE_WEIGHT POST_WEIGHT
## 1  2394   137.0663          NA
## 2  2394         NA    128.2492
## 3  2394         NA    128.2000
## 4  2395   164.3000          NA
## 5  2395   164.3255          NA
## 6  2395         NA    164.3350
## 7  2395         NA    164.3500
## 8  2396   160.9983          NA
## 9  2396   160.9985          NA
## 10 2396         NA    165.7400
## 11 2396         NA    165.7270
## 12 2397   147.7983          NA
## 13 2397   147.7990          NA
## 14 2397         NA    137.8439
## 15 2397         NA    137.8500
## 16 2398   133.1364          NA
## 17 2398   133.1400          NA
## 18 2398         NA    118.9500
## 19 2398         NA    118.9398
## 20 2399   188.8684          NA
## 21 2399   188.9000          NA
## 22 2399         NA    179.1000
## 23 2399         NA    179.0564
## 24 2400   166.1632          NA
## 25 2400   166.2000          NA
## 26 2400         NA    150.2311
## 27 2400         NA    150.2000
## 28 2401   151.6768          NA
## 29 2401   151.7000          NA
## 30 2401         NA    133.3000
## 31 2401         NA    133.2508
## 32 2402   148.7000          NA
## 33 2402   148.7475          NA
## 34 2402         NA    136.2554
## 35 2402         NA    136.6000

So, we see that for these patients there are double entries, sometimes for the pre-weight, post-weight, or both. The strategy we will use to deal with this is to truncate the values of weight (pre- and post-intervention). Truncating a number means retaining the integer part of it only. So, for example, truncating 1.2 and 1.9 gives the same result, 1. This will hardly affect the results of the study. Truncating the weight data will create exact row duplicates and we will then eliminate duplicates. The function mutate_all() in the dplyr package is useful to apply a function to each column. In this case the function we want to apply is trunc().

data3_1 <- mutate_all(data3_1, trunc)#this call truncates all the columns to the nearest integer
glimpse(data3_1)
## Observations: 5,017
## Variables: 3
## $ id          <dbl> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9,...
## $ PRE_WEIGHT  <dbl> 135, NA, 154, NA, 128, NA, 183, NA, 166, NA, 120, ...
## $ POST_WEIGHT <dbl> NA, 125, NA, 153, NA, 115, NA, 177, NA, 163, NA, 1...

Now, we remove duplicate rows:

data3_1 <- distinct(data3_1)
glimpse(data3_1)
## Observations: 5,000
## Variables: 3
## $ id          <dbl> 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9,...
## $ PRE_WEIGHT  <dbl> 135, NA, 154, NA, 128, NA, 183, NA, 166, NA, 120, ...
## $ POST_WEIGHT <dbl> NA, 125, NA, 153, NA, 115, NA, 177, NA, 163, NA, 1...

Now we have 5000 rows but we need to have one row per patient with their id, pre-weight and post-weight.

The strategy to achieve this is to select id and pre-weight, eliminate the rows where pre-weight is NA. Next select id and post-weight and eliminate the rows where post-weight value is NA. Next merge both data frames into one by patient id.

temp1 <- select(data3_1, -POST_WEIGHT)#select all columns except POST_WEIGHT
temp1 <- filter(temp1, is.na(PRE_WEIGHT) == "FALSE")#filter out the NAs in PRE_WEIGHT
glimpse(temp1)
## Observations: 2,500
## Variables: 2
## $ id         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ PRE_WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, 1...
temp2 <- select(data3_1, -PRE_WEIGHT)#select all columns except PRE_WEIGHT
temp2 <- filter(temp2, is.na(POST_WEIGHT) == "FALSE")#filter out the NAs in POST_WEIGHT
glimpse(temp2)
## Observations: 2,500
## Variables: 2
## $ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...
data3_1 <- inner_join(temp1,temp2)
## Joining, by = "id"
glimpse(data3_1)
## Observations: 2,500
## Variables: 3
## $ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ PRE_WEIGHT  <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, ...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...


Summary of the tidying process using the %>% operator

data3_aux <- read_csv("DataFiles/EXER_weight_trt.csv") %>%
  rename(id = Id) %>%
  mutate_all(trunc) %>%
  distinct()
## Parsed with column specification:
## cols(
##   Id = col_integer(),
##   PRE_WEIGHT = col_double(),
##   POST_WEIGHT = col_double()
## )
temp1 <- data3_aux %>% 
  select(-POST_WEIGHT) %>%
  filter(is.na(PRE_WEIGHT) == "FALSE")
temp2 <- data3_aux %>% 
  select(-PRE_WEIGHT) %>%
  filter(is.na(POST_WEIGHT) == "FALSE")
data3_tidy <- inner_join(temp1,temp2)
## Joining, by = "id"



1.4 Fourth data set (data4)

glimpse(data4)
## Observations: 5,012
## Variables: 3
## $ obs_ID      <int> 2501, 2501, 2502, 2502, 2503, 2503, 2504, 2504, 25...
## $ PRE_WEIGHT  <dbl> 159.7587, NA, 176.1611, NA, 181.3907, NA, 175.6615...
## $ POST_WEIGHT <dbl> NA, 158.6920, NA, 174.8270, NA, 179.9042, NA, 175....
head(data4)
## # A tibble: 6 x 3
##   obs_ID PRE_WEIGHT POST_WEIGHT
##    <int>      <dbl>       <dbl>
## 1   2501       160.         NA 
## 2   2501        NA         159.
## 3   2502       176.         NA 
## 4   2502        NA         175.
## 5   2503       181.         NA 
## 6   2503        NA         180.
data4_1 <- data4
data4_1 <- rename(data4_1, id = obs_ID)

There are 5,012 rows in this data set when there should be just 5,000. Similar inefficient way of recording information as in previous data set.

Let us first investigate if there are patient with duplicate records.

data4_1 <- distinct(data4_1)
glimpse(data4_1)
## Observations: 5,012
## Variables: 3
## $ id          <int> 2501, 2501, 2502, 2502, 2503, 2503, 2504, 2504, 25...
## $ PRE_WEIGHT  <dbl> 159.7587, NA, 176.1611, NA, 181.3907, NA, 175.6615...
## $ POST_WEIGHT <dbl> NA, 158.6920, NA, 174.8270, NA, 179.9042, NA, 175....

There aren’t any duplicate rows.

Let us investigate if any patients have duplicate records.

aux1 <- count(data3_1,id)#counts how many instances of each patient id there are in the whole data set
id_trouble <- aux1 %>% filter(n > 2) %>% select(id) %>% pull(id)
id_trouble
## numeric(0)

Patients 4980, 4981, 4982, 4983, 4984 and 4985 have more than two records each.

filter(data4_1, id %in% id_trouble)
## # A tibble: 0 x 3
## # ... with 3 variables: id <int>, PRE_WEIGHT <dbl>, POST_WEIGHT <dbl>

As before, we truncate the data and eliminate any duplicate rows.

data4_1 <- mutate_all(data4_1, trunc)#this call truncates all the columns to 1 decimal place
data4_1 <- distinct(data4_1)#eliminate duplicate rows
glimpse(data4_1)
## Observations: 5,000
## Variables: 3
## $ id          <dbl> 2501, 2501, 2502, 2502, 2503, 2503, 2504, 2504, 25...
## $ PRE_WEIGHT  <dbl> 159, NA, 176, NA, 181, NA, 175, NA, 136, NA, 160, ...
## $ POST_WEIGHT <dbl> NA, 158, NA, 174, NA, 179, NA, 175, NA, 137, NA, 1...

We have 5,000 rows now.

We have to create a data frame with one row per patient, as before.

temp1 <- select(data4_1, -POST_WEIGHT)#select all columns except POST_WEIGHT
temp1 <- filter(temp1, is.na(PRE_WEIGHT) == "FALSE")#filter out the NAs in PRE_WEIGHT
glimpse(temp1)
## Observations: 2,500
## Variables: 2
## $ id         <dbl> 2501, 2502, 2503, 2504, 2505, 2506, 2507, 2508, 250...
## $ PRE_WEIGHT <dbl> 159, 176, 181, 175, 136, 160, 153, 163, 161, 178, 1...
#
temp2 <- select(data4_1, -PRE_WEIGHT)#select all columns except PRE_WEIGHT
temp2 <- filter(temp2, is.na(POST_WEIGHT) == "FALSE")#filter out the NAs in POST_WEIGHT
glimpse(temp2)
## Observations: 2,500
## Variables: 2
## $ id          <dbl> 2501, 2502, 2503, 2504, 2505, 2506, 2507, 2508, 25...
## $ POST_WEIGHT <dbl> 158, 174, 179, 175, 137, 159, 155, 164, 160, 177, ...
#
data4_1 <- inner_join(temp1,temp2)
## Joining, by = "id"
glimpse(data4_1)
## Observations: 2,500
## Variables: 3
## $ id          <dbl> 2501, 2502, 2503, 2504, 2505, 2506, 2507, 2508, 25...
## $ PRE_WEIGHT  <dbl> 159, 176, 181, 175, 136, 160, 153, 163, 161, 178, ...
## $ POST_WEIGHT <dbl> 158, 174, 179, 175, 137, 159, 155, 164, 160, 177, ...



EXERCISE. Produce an object data4_tidy which is the tidy version of data4. Do not create any other objects besides data4_tidy. Hint: use the %>% operator



Next we are going to join the rows of data3_1 and data4_1 to create data3_2 so that we have all patients in one data frame. First we are going to add a column in data3_1 called trt with all values equal to 1, and a column in data4_1 called trt as well but with all values equal to zero.

data3_1 <- mutate(data3_1, trt = 1)
data4_1 <- mutate(data3_1, trt = 0)
head(data3_1)
## # A tibble: 6 x 4
##      id PRE_WEIGHT POST_WEIGHT   trt
##   <dbl>      <dbl>       <dbl> <dbl>
## 1     1        135         125     1
## 2     2        154         153     1
## 3     3        128         115     1
## 4     4        183         177     1
## 5     5        166         163     1
## 6     6        120         105     1
head(data4_1)
## # A tibble: 6 x 4
##      id PRE_WEIGHT POST_WEIGHT   trt
##   <dbl>      <dbl>       <dbl> <dbl>
## 1     1        135         125     0
## 2     2        154         153     0
## 3     3        128         115     0
## 4     4        183         177     0
## 5     5        166         163     0
## 6     6        120         105     0
data3_2 <- bind_rows(data3_1,data4_1)
glimpse(data3_2)
## Observations: 5,000
## Variables: 4
## $ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ PRE_WEIGHT  <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, ...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...
## $ trt         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...

1.5 A unified data set

Now we must merge data1_1, data2_1 and data3_2.

data_exer1 <- inner_join(data1_1, data2_1)
## Joining, by = "id"
data_exer1 <- inner_join(data_exer1, data3_2)
## Joining, by = c("id", "trt")
glimpse(data_exer1)
## Observations: 2,500
## Variables: 9
## $ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Sex         <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "F...
## $ Age         <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47...
## $ Race        <chr> "White", "White", "White", "Hispanic", "White", NA...
## $ trt         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ PRE_SRH     <chr> "Good", "Poor", "Satisfactory", "Poor", "Poor", "G...
## $ POST_SRH    <chr> "Poor", "Very Poor", "Good", "Good", "Poor", "Exce...
## $ PRE_WEIGHT  <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, ...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...

We can further narrow the number of variables introducing a variable Time, with values PRE and POST, and gather SRH and Weight. To do that we will first create one data frame with all the fixed variables and pre- and post-SRH and another data frame with all the fixed variables and pre- and post-WEIGHT. We will gather the pre- and post- column into Time and SRH or WEIGHT and then we will join both data sets.

temp1 <- select(data_exer1, 1:7)
temp1 <- gather(temp1, "Time", "SRH", 6:7)
temp1 <- mutate(temp1, Time = str_replace_all(Time, "PRE_SRH", "PRE"))
temp1 <- mutate(temp1, Time = str_replace_all(Time, "POST_SRH", "POST"))
glimpse(temp1)
## Observations: 5,000
## Variables: 7
## $ id   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17...
## $ Sex  <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE",...
## $ Age  <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, 37....
## $ Race <chr> "White", "White", "White", "Hispanic", "White", NA, "Whit...
## $ trt  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Time <chr> "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "...
## $ SRH  <chr> "Good", "Poor", "Satisfactory", "Poor", "Poor", "Good", "...
temp2 <- select(data_exer1, c(1:5,8,9))
temp2 <- gather(temp2, "Time", "WEIGHT", 6:7)
temp2 <- mutate(temp2, Time = str_replace_all(Time, "PRE_WEIGHT", "PRE"))
temp2 <- mutate(temp2, Time = str_replace_all(Time, "POST_WEIGHT", "POST"))
glimpse(temp2)
## Observations: 5,000
## Variables: 7
## $ id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Sex    <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE...
## $ Age    <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, 3...
## $ Race   <chr> "White", "White", "White", "Hispanic", "White", NA, "Wh...
## $ trt    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Time   <chr> "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE",...
## $ WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, 149, ...

We now merge temp1 and temp2

data_exer2 <- inner_join(temp1, temp2)
## Joining, by = c("id", "Sex", "Age", "Race", "trt", "Time")
glimpse(data_exer2)
## Observations: 5,000
## Variables: 8
## $ id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Sex    <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE...
## $ Age    <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, 3...
## $ Race   <chr> "White", "White", "White", "Hispanic", "White", NA, "Wh...
## $ trt    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Time   <chr> "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE",...
## $ SRH    <chr> "Good", "Poor", "Satisfactory", "Poor", "Poor", "Good",...
## $ WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, 149, ...


Let us summarise the creation of data_exer2 using the %>% operator.

temp1 <- select(data_exer1, 1:7) %>% 
  gather("Time", "SRH", 6:7) %>% 
  mutate(Time = str_replace_all(Time, "PRE_SRH", "PRE")) %>% 
  mutate(Time = str_replace_all(Time, "POST_SRH", "POST"))
temp2 <- select(data_exer1, c(1:5,8,9)) %>% 
  gather("Time", "WEIGHT", 6:7) %>%
  mutate(Time = str_replace_all(Time, "PRE_WEIGHT", "PRE")) %>%
  mutate(Time = str_replace_all(Time, "POST_WEIGHT", "POST"))
data_exer2 <- inner_join(temp1, temp2)
## Joining, by = c("id", "Sex", "Age", "Race", "trt", "Time")
glimpse(data_exer2)
## Observations: 5,000
## Variables: 8
## $ id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Sex    <chr> "MALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE", "FEMALE...
## $ Age    <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, 3...
## $ Race   <chr> "White", "White", "White", "Hispanic", "White", NA, "Wh...
## $ trt    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Time   <chr> "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE", "PRE",...
## $ SRH    <chr> "Good", "Poor", "Satisfactory", "Poor", "Poor", "Good",...
## $ WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, 149, ...

Both data_exer1 and data_exer2 are tidy, and it is a matter of preference and how the data will be used to choose a format to work with.

We will work with both data frames for the purpose of visualising the data.

The variable SRH is an ordinal categorical variable. Let us tell R about the ordinal features of SRH. Also, Time is categorical ordinal. Sex and Race are simply categorical. Treatment, trt, is also a factor and we will let R know and re-label the levels from 0 to Control and 1 to Treatment. We will do this for both data frames data_exer1 and data_exer2.


We introduce now a new Tidyverse package called forcats which is designed to deal with factors.



srhlev <- c("Very Poor", "Poor", "Satisfactory", "Good", "Excellent")
data_exer1 <- data_exer1 %>% 
  mutate(trt = factor(trt), PRE_SRH = factor(PRE_SRH), POST_SRH = factor(POST_SRH), Sex = factor(Sex), Race = factor(Race)) %>%
  mutate(trt = fct_recode(trt, Control = "0", Treatment = "1")) %>%
  mutate(PRE_SRH = fct_relevel(PRE_SRH, srhlev)) %>%
  mutate(POST_SRH = fct_relevel(POST_SRH, srhlev)) 
## Warning: Unknown levels in `f`: 0
glimpse(data_exer1)
## Observations: 2,500
## Variables: 9
## $ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Sex         <fct> MALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMA...
## $ Age         <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47...
## $ Race        <fct> White, White, White, Hispanic, White, NA, White, H...
## $ trt         <fct> Treatment, Treatment, Treatment, Treatment, Treatm...
## $ PRE_SRH     <fct> Good, Poor, Satisfactory, Poor, Poor, Good, Excell...
## $ POST_SRH    <fct> Poor, Very Poor, Good, Good, Poor, Excellent, Exce...
## $ PRE_WEIGHT  <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, ...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...
summary(data_exer1)
##        id             Sex            Age              Race     
##  Min.   :   1.0   FEMALE:1271   Min.   :18.10   Asian   : 224  
##  1st Qu.: 625.8   MALE  :1229   1st Qu.:33.70   Black   : 232  
##  Median :1250.5                 Median :38.90   Hispanic: 414  
##  Mean   :1250.5                 Mean   :38.94   White   :1436  
##  3rd Qu.:1875.2                 3rd Qu.:44.12   NA's    : 194  
##  Max.   :2500.0                 Max.   :69.30                  
##         trt               PRE_SRH            POST_SRH     PRE_WEIGHT   
##  Treatment:2500   Very Poor   :663   Very Poor   :424   Min.   : 82.0  
##                   Poor        :496   Poor        :267   1st Qu.:144.0  
##                   Satisfactory:296   Satisfactory:441   Median :159.0  
##                   Good        :371   Good        :481   Mean   :159.3  
##                   Excellent   :674   Excellent   :887   3rd Qu.:174.0  
##                                                         Max.   :242.0  
##   POST_WEIGHT   
##  Min.   : 78.0  
##  1st Qu.:135.0  
##  Median :150.0  
##  Mean   :150.3  
##  3rd Qu.:164.0  
##  Max.   :221.0

Now for data_exer2, we also set the order of the levels of Time to be PRE and POST.

data_exer2 <- data_exer2 %>% 
  mutate(trt = factor(trt), SRH = factor(SRH), Sex = factor(Sex), Race = factor(Race), Time = factor(Time)) %>%
  mutate(trt = fct_recode(trt, Control = "0", Treatment = "1")) %>%
  mutate(SRH = fct_relevel(SRH, srhlev)) %>%
  mutate(Time = fct_relevel(Time, "PRE", "POST")) 
## Warning: Unknown levels in `f`: 0
glimpse(data_exer2)
## Observations: 5,000
## Variables: 8
## $ id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Sex    <fct> MALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, M...
## $ Age    <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, 3...
## $ Race   <fct> White, White, White, Hispanic, White, NA, White, Hispan...
## $ trt    <fct> Treatment, Treatment, Treatment, Treatment, Treatment, ...
## $ Time   <fct> PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, ...
## $ SRH    <fct> Good, Poor, Satisfactory, Poor, Poor, Good, Excellent, ...
## $ WEIGHT <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, 149, ...


The variable Age is of a continuous nature. However, it is not really expected that at each small change in age we would observe changes in treatment effects. Also, in case effects change according to age, a change of, say 5 years, at young age will not possibly see the same effect as a change of 5 years at middle or old age. Therefore, we will consider age groups rather than a continuous range of age. Finding optimal age groups is in itself a topic that can be discussed at length. For the purposes of exploration, we will transform the age variable into categorical type with groups. Note that we will use Age from data_exer1 because it contains just one row per observational unit.

summary(data_exer1$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.10   33.70   38.90   38.94   44.12   69.30
aux <- mutate(data_exer1, cut(Age, c(18, 20, 30, 40, 50, 60,70)))
hist(data_exer1$Age)

summary(aux)
##        id             Sex            Age              Race     
##  Min.   :   1.0   FEMALE:1271   Min.   :18.10   Asian   : 224  
##  1st Qu.: 625.8   MALE  :1229   1st Qu.:33.70   Black   : 232  
##  Median :1250.5                 Median :38.90   Hispanic: 414  
##  Mean   :1250.5                 Mean   :38.94   White   :1436  
##  3rd Qu.:1875.2                 3rd Qu.:44.12   NA's    : 194  
##  Max.   :2500.0                 Max.   :69.30                  
##         trt               PRE_SRH            POST_SRH     PRE_WEIGHT   
##  Treatment:2500   Very Poor   :663   Very Poor   :424   Min.   : 82.0  
##                   Poor        :496   Poor        :267   1st Qu.:144.0  
##                   Satisfactory:296   Satisfactory:441   Median :159.0  
##                   Good        :371   Good        :481   Mean   :159.3  
##                   Excellent   :674   Excellent   :887   3rd Qu.:174.0  
##                                                         Max.   :242.0  
##   POST_WEIGHT    cut(Age, c(18, 20, 30, 40, 50, 60, 70))
##  Min.   : 78.0   (18,20]:  16                           
##  1st Qu.:135.0   (20,30]: 319                           
##  Median :150.0   (30,40]:1057                           
##  Mean   :150.3   (40,50]: 911                           
##  3rd Qu.:164.0   (50,60]: 183                           
##  Max.   :221.0   (60,70]:  14

We see that most of the participants are between 30 and 50 years of age. No participants are younger than 18 years or older than 70 years. We will split age into 18-34yrs, 35-40yrs, 41-45yrs, 46-70yrs, so that we don’t have an age category over-represented with number of patients. We create a new column in both data_exer1 and data_exer2 called Age_cat which indicates the age group of the patient.

data_exer1 <- mutate(data_exer1, Age_cat = cut(data_exer1$Age, c(18, 34, 39, 44, 70)))
data_exer2 <- mutate(data_exer2, Age_cat = cut(data_exer2$Age, c(18, 34, 39, 44, 70)))
summary(data_exer1$Age_cat)
## (18,34] (34,39] (39,44] (44,70] 
##     655     614     597     634

Let us see how the data looks like now

glimpse(data_exer1)
## Observations: 2,500
## Variables: 10
## $ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Sex         <fct> MALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMA...
## $ Age         <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47...
## $ Race        <fct> White, White, White, Hispanic, White, NA, White, H...
## $ trt         <fct> Treatment, Treatment, Treatment, Treatment, Treatm...
## $ PRE_SRH     <fct> Good, Poor, Satisfactory, Poor, Poor, Good, Excell...
## $ POST_SRH    <fct> Poor, Very Poor, Good, Good, Poor, Excellent, Exce...
## $ PRE_WEIGHT  <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, ...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...
## $ Age_cat     <fct> (39,44], (39,44], (34,39], (34,39], (44,70], (34,3...
glimpse(data_exer2)
## Observations: 5,000
## Variables: 9
## $ id      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Sex     <fct> MALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, ...
## $ Age     <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, ...
## $ Race    <fct> White, White, White, Hispanic, White, NA, White, Hispa...
## $ trt     <fct> Treatment, Treatment, Treatment, Treatment, Treatment,...
## $ Time    <fct> PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE,...
## $ SRH     <fct> Good, Poor, Satisfactory, Poor, Poor, Good, Excellent,...
## $ WEIGHT  <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, 149,...
## $ Age_cat <fct> (39,44], (39,44], (34,39], (34,39], (44,70], (34,39], ...

Let us shuffle the columns so that Age and Age_cat are contiguous.

data_exer1 <- select(data_exer1, c(1:3,10,4:9))
data_exer2 <- select(data_exer2, c(1:3,9,4:8))
glimpse(data_exer1)
## Observations: 2,500
## Variables: 10
## $ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Sex         <fct> MALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMA...
## $ Age         <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47...
## $ Age_cat     <fct> (39,44], (39,44], (34,39], (34,39], (44,70], (34,3...
## $ Race        <fct> White, White, White, Hispanic, White, NA, White, H...
## $ trt         <fct> Treatment, Treatment, Treatment, Treatment, Treatm...
## $ PRE_SRH     <fct> Good, Poor, Satisfactory, Poor, Poor, Good, Excell...
## $ POST_SRH    <fct> Poor, Very Poor, Good, Good, Poor, Excellent, Exce...
## $ PRE_WEIGHT  <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, ...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...
glimpse(data_exer2)
## Observations: 5,000
## Variables: 9
## $ id      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,...
## $ Sex     <fct> MALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, ...
## $ Age     <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47.1, ...
## $ Age_cat <fct> (39,44], (39,44], (34,39], (34,39], (44,70], (34,39], ...
## $ Race    <fct> White, White, White, Hispanic, White, NA, White, Hispa...
## $ trt     <fct> Treatment, Treatment, Treatment, Treatment, Treatment,...
## $ Time    <fct> PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE, PRE,...
## $ SRH     <fct> Good, Poor, Satisfactory, Poor, Poor, Good, Excellent,...
## $ WEIGHT  <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, 149,...
summary(data_exer2$trt)
## Treatment 
##      5000

1.6 Tabular visualisation of data

Using dplyr and tidyr we can use the functions group_by() and summarise()

data_exer2 %>% group_by(SRH, Time) %>% summarise(number = n())
## # A tibble: 10 x 3
## # Groups:   SRH [?]
##    SRH          Time  number
##    <fct>        <fct>  <int>
##  1 Very Poor    PRE      663
##  2 Very Poor    POST     424
##  3 Poor         PRE      496
##  4 Poor         POST     267
##  5 Satisfactory PRE      296
##  6 Satisfactory POST     441
##  7 Good         PRE      371
##  8 Good         POST     481
##  9 Excellent    PRE      674
## 10 Excellent    POST     887

To get a proper cross-tabulation use spread()

data_exer2 %>% group_by(SRH, Time) %>% summarise(number = n()) %>%
spread(key = Time, value = number)
## # A tibble: 5 x 3
## # Groups:   SRH [5]
##   SRH            PRE  POST
##   <fct>        <int> <int>
## 1 Very Poor      663   424
## 2 Poor           496   267
## 3 Satisfactory   296   441
## 4 Good           371   481
## 5 Excellent      674   887

Let us see a few more effective ways (using base R) in which we can tabulate some aspects of the data. We can use table(), ftable() (more than two variables) or xtabs().

with(data_exer2, table(SRH, Time))
##               Time
## SRH            PRE POST
##   Very Poor    663  424
##   Poor         496  267
##   Satisfactory 296  441
##   Good         371  481
##   Excellent    674  887

Without considering any other factor, there are less patients in the Poor and Very Poor categories after the intervention, and there are more patients in the Satisfactory, Good and Excellent categories after the intervention than before the intervention.

Let us see if the trend stays when we consider also the patient’s gender.

aux2 <- with(data_exer2, table(SRH, Time, Sex))
aux2
## , , Sex = FEMALE
## 
##               Time
## SRH            PRE POST
##   Very Poor    343  263
##   Poor         228  193
##   Satisfactory 168  180
##   Good         191  187
##   Excellent    341  448
## 
## , , Sex = MALE
## 
##               Time
## SRH            PRE POST
##   Very Poor    320  161
##   Poor         268   74
##   Satisfactory 128  261
##   Good         180  294
##   Excellent    333  439

or alternatively,

aux3 <- with(data_exer2, ftable(SRH, Time, Sex))
aux3
##                   Sex FEMALE MALE
## SRH          Time                
## Very Poor    PRE         343  320
##              POST        263  161
## Poor         PRE         228  268
##              POST        193   74
## Satisfactory PRE         168  128
##              POST        180  261
## Good         PRE         191  180
##              POST        187  294
## Excellent    PRE         341  333
##              POST        448  439

or

aux4 <- xtabs(~SRH + Time + Sex, data = data_exer2)
aux4
## , , Sex = FEMALE
## 
##               Time
## SRH            PRE POST
##   Very Poor    343  263
##   Poor         228  193
##   Satisfactory 168  180
##   Good         191  187
##   Excellent    341  448
## 
## , , Sex = MALE
## 
##               Time
## SRH            PRE POST
##   Very Poor    320  161
##   Poor         268   74
##   Satisfactory 128  261
##   Good         180  294
##   Excellent    333  439

The trend of positive effect of the exercise plan is still present when we consider gender although it seems to be stronger for men than for women.

with(data_exer2, table(SRH, Time, Age_cat,trt))
## , , Age_cat = (18,34], trt = Treatment
## 
##               Time
## SRH            PRE POST
##   Very Poor      4    3
##   Poor           0    0
##   Satisfactory  35    0
##   Good          10    0
##   Excellent    606  652
## 
## , , Age_cat = (34,39], trt = Treatment
## 
##               Time
## SRH            PRE POST
##   Very Poor     67   37
##   Poor         108   46
##   Satisfactory 140   12
##   Good         231  284
##   Excellent     68  235
## 
## , , Age_cat = (39,44], trt = Treatment
## 
##               Time
## SRH            PRE POST
##   Very Poor    201   70
##   Poor         170   96
##   Satisfactory 121  270
##   Good         105  161
##   Excellent      0    0
## 
## , , Age_cat = (44,70], trt = Treatment
## 
##               Time
## SRH            PRE POST
##   Very Poor    391  314
##   Poor         218  125
##   Satisfactory   0  159
##   Good          25   36
##   Excellent      0    0

1.7 Graphical visualisation of the data

1.7.1 Visualising the categorical outcome SRH

We use mosaic plots to visualise the categorical outcome in relation to the other categorical variables in the study.

First, we simply consider SRH by gender for the different groups, before and after the treatment.

ggplot(data = data_exer2) +
  geom_mosaic(aes( x = product(SRH, Time), fill = SRH), na.rm = TRUE) + 
  theme(axis.text.x=element_text(angle = -90, hjust = .1, size = 4)) +
  facet_grid(.~trt, drop = TRUE) + 
  theme(axis.text.x = element_text(angle = 0, hjust = .5, size = 8)) +
  theme(axis.title = element_blank()) +
  theme(legend.position = "bottom") +
  ggtitle ('SRH for treatment and control groups')

It is clear that there is little difference between pre- and post- in the controls. There is a bigger yellow, green and cyan areas and smaller purple and blue areas in the treatments post-intervention. This indicates a benefitial effect of the treatment.

ggplot(data = data_exer2) +
  geom_mosaic(aes( x = product(SRH, Time), fill = SRH), na.rm = TRUE) + 
  theme(axis.text.x=element_text(angle = -90, hjust = .1, size = 4)) +
  facet_grid(trt~Sex, drop = TRUE) + 
  theme(axis.text.x = element_text(angle = 0, hjust = .5, size = 8)) +
  theme(axis.title = element_blank()) +
  theme(legend.position = "bottom") +
  ggtitle ('SRH by gender for treatment and control groups')

The positive, beneficial trend of the treament is still showing when we add gender.

In the following mosaic plots we visualise 5 variables (in 2-D) at the same time!

ggplot(data = data_exer2) +
  geom_mosaic(aes( x = product(SRH, Age_cat, Sex), fill = SRH), na.rm = TRUE) + 
  theme(axis.text.x=element_text(angle = -90, hjust = .1, size = 4)) +
  facet_grid(trt~Time, drop = TRUE) + 
  scale_x_productlist(breaks = c(0.25, 0.75), labels = c("FEMALE", "MALE")) +
  theme(axis.text.x = element_text(angle = 0, hjust = .5, size = 8)) +
  theme(axis.title.x = element_blank()) +
  labs(y = "Age Group") +
  theme(legend.position = "bottom") +
  ggtitle ('SRH by age and gender for treatment and control groups')

In the plot above we plotted contiguous mosaics for females and males, before and after the study. For the Treatment group, we see more yellow and less purple in the post mosaics than in the pre mosaics, for both genders and across age categories. This hints there might be a beneficial effect on SRH of the exercise programme. There is little change in the control group.


We may want to plot contiguous mosaics for pre- and post-study SRH by gender.

ggplot(data = data_exer2) +
  geom_mosaic(aes( x = product(SRH, Age_cat, Time), fill = SRH), na.rm = TRUE) + 
  theme(axis.text.x=element_text(angle = -90, hjust = .1, size = 4)) +
  facet_grid(trt~Sex, drop = TRUE) + 
  scale_x_productlist(breaks = c(0.25, 0.75), labels = c("PRE", "POST")) +
  theme(axis.text.x = element_text(angle = 0, hjust = .5, size = 8)) +
  theme(axis.title = element_blank()) +
  theme(legend.position = "bottom") +
  ggtitle ('SRH by age and gender for treatment and control groups')

We can replace Age by Race.

ggplot(data = subset(data_exer2, is.na(Race) == "FALSE")) +
  geom_mosaic(aes( x = product(SRH, Race, Time), fill = SRH), na.rm = TRUE) + 
  theme(axis.text.x = element_text(angle = -90, hjust = .1, size = 4)) +
  facet_grid(trt ~ Sex) + 
  scale_x_productlist(breaks = c(0.25, 0.75), labels = c("PRE", "POST")) +
  theme(axis.text.x = element_text(angle = 0, hjust = .5, size = 8)) +
  theme(axis.title = element_blank()) +
  theme(legend.position = "bottom") +
  ggtitle ('SRH by race and gender for treatment and control groups') +
  theme(legend.position = "bottom")

We see a slight trend of the exercise being beneficial for the treatment group, for both genders and across race groups, in terms of SRH. The post group has narrower purple areas and larger yellow areas. The cyan and green areas are wider post-treatment for males. Hispanic and black males, and also females, seem to benefit the most (look at the purple areas how they decrease post-treatment). Again, the effect seems to be stronger for males than females.

1.7.2 Visualising the continuous outcome weight

Let us visualise now the measured variables that are continuous, namely Weight. We plot POST_WEIGHT vs PRE_WEIGHT for males (blue) and females (red), before (solid line) and after (dotted line) the intervention. We also add a 45 degree line through zero (symbolising no treatment effect on weight, i.e. weight before is equal to weight after).

ggplot(data_exer1, aes(x = PRE_WEIGHT, y = POST_WEIGHT, col = Sex, linetype = trt)) +
  geom_point(alpha = 0.2) +
  stat_smooth(method = "loess", se = FALSE,lwd=1) +
  geom_abline(intercept = 0, slope = 1, color = "green", lwd = 1)

We observe that for both sexes the curves of the control patients are nearly overlapping the no-effect green line, as expected. The treatment curve for women is very near the no-effect green line. There seems to be a very mild positive effect for women whose weights before the intervention were below 150 pounds. The intervention seems to be effective for weight loss for men who underwent the exercise programme, as the blue dotted curve is clearly below the no-effect green line.

Let us split the data into age categories.

ggplot(data_exer1, aes(x = PRE_WEIGHT, y = POST_WEIGHT, col = Sex, linetype = trt )) +
  geom_point(alpha = 0.2) +
  facet_grid(.~Age_cat) +
  stat_smooth(method = "loess", se = FALSE, lwd = 1) +
  geom_abline(intercept = 0, slope = 1, color = "green", lwd = 1) +
  theme(legend.position = "bottom")

The observations we made before still hold when we split the group by age category. The effect of the exercise programme seems to be more beneficial for men aged 45 or older who weigh over 200 pounds.

Now, let us split the group by race

ggplot(data = subset(data_exer1,is.na(Race) == "FALSE"), aes(x = PRE_WEIGHT, y = POST_WEIGHT, col = Sex, linetype = trt )) +
  geom_point(alpha = 0.2) +
  facet_grid(.~Race) +
  stat_smooth(method = "loess", se = FALSE, lwd = 1) +
  geom_abline(intercept = 0, slope = 1, color = "green", lwd = 1) +
  theme(legend.position = "bottom")

We generally observe the same trends as before for women and for men although the treatment seems to be more effective for white men, specially for those who are overweight.



1.8 EXERCISE

Define proportional weight loss as prWL = (POST_WEIGHT - PRE_WEIGHT)/PRE_WEIGHT, that is the proportion of weight lost relative to initial weight. Values of prWL near zero are for little or no effect of the intervention, large negative values are for a positive effect for weight loss, as a percentage of the original weight.

  • Create the new column prWL and append it to data_exer1.

  • Explore the distribution of this new variable conditional on gender and treatment group using boxplots.

  • For the treatment group, explore the distribution of this new variable for each race group, conditional on gender and age.

The solution is below. If you would like to attempt the question before looking at the solution, do not scroll down.

























Solution

data_exer1 <-  mutate(data_exer1, prWL = (POST_WEIGHT - PRE_WEIGHT)/PRE_WEIGHT)# compute and append a new column prWL
glimpse(data_exer1)
## Observations: 2,500
## Variables: 11
## $ id          <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Sex         <fct> MALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMALE, FEMA...
## $ Age         <dbl> 41.2, 42.9, 38.5, 35.6, 48.5, 36.9, 28.9, 24.7, 47...
## $ Age_cat     <fct> (39,44], (39,44], (34,39], (34,39], (44,70], (34,3...
## $ Race        <fct> White, White, White, Hispanic, White, NA, White, H...
## $ trt         <fct> Treatment, Treatment, Treatment, Treatment, Treatm...
## $ PRE_SRH     <fct> Good, Poor, Satisfactory, Poor, Poor, Good, Excell...
## $ POST_SRH    <fct> Poor, Very Poor, Good, Good, Poor, Excellent, Exce...
## $ PRE_WEIGHT  <dbl> 135, 154, 128, 183, 166, 120, 158, 160, 127, 187, ...
## $ POST_WEIGHT <dbl> 125, 153, 115, 177, 163, 105, 151, 143, 119, 159, ...
## $ prWL        <dbl> -0.074074074, -0.006493506, -0.101562500, -0.03278...
ggplot(data_exer1, aes(x = trt, y = prWL, col = Sex)) +
  geom_boxplot()

For control patients there is little variation and, there is a small placebo effect in men. perhaps knowing that their weight was going to be monitored brought awareness to their diet and life style. For treatment patients we observe more variation in the proportional weight loss, with a marked more positive effect for men than for women. There are a few exceptional cases, where patients lose a large proportion of their initial weight, nearly 25% for some men, whereas other men gain weight, although not a large proportion of their initial weight. 50% of men lose about 9% or more of their initial weight. For women the effect is not as pronounced with a median proportional weight loss of about 2% and some exceptional cases gaining more than 10% of their original weight by the end of study (perhaps emotional eating due to anxiety of weight being monitored?).

Let us focus now only on treatment cases and see how the effectiveness of the intervention in terms of weight loss is related to the other variables.

subset(data_exer1, is.na(Race) == "FALSE" & trt == "Treatment") %>%
ggplot(aes(x = Age_cat, y = prWL, col = Sex)) +
  geom_boxplot() +
  scale_x_discrete("Age group") +
  scale_y_continuous("Weight lost as a proportion of initial weight") +
  facet_grid(.~ Race) +
  theme(legend.position = "bottom")

We can see in the previous plot, for example, that proportional weight loss seems to be more or less equally effective across age categories for white women, slightly more effective for older white, Asian and black men, but less effective for older Hispanic men. For black women the intervention seems to become more effective as age increases.

We conclude that the effect of the exercise programme seems to have a beneficial effect in both self-rated health and weight loss for many patients and that the effectiveness of the intervention is related to demographic characteristics such as age, sex and race.